Assignment 1: About the project
Here’s the link to the repo on github: https://github.com/oj-lappi/IODS-project
My background and motivations
I’d like to learn about statistical tests and possibly how to add R to data pipelines.
I don’t think R will replace my current tools for analyzing and plotting data, but I do want to at least see how feasible it would be to add R as a tool for developing statistical analyses for experiments, and possibly writing some small programs for quick statistical analysis in postprocessing of experiments.
I’m familiar with git, it’s a daily driver for me, I tend to use gnuplot and python for quick plotting, and either use python, UNIX tools or custom C++ programs for data analysis utilities.
Goals
I’m interested in creating automatic data pipelines that process data in object storage and upload plots to a dashboard. R+Github pages might be a good way to do that, although I would prefer an on-prem solution (HU’s gitlab maybe?) or a self-hosted website.
RStudio/Git integration
I have ssh keys set up on each of my computers, and a personal access token would be less flexible than that.
I can push and pull with this setup from RStudio.
Exercises
I checked the exercises and followed some of them along locally, I think I’m getting the hang of it :).
# Timestamp:
date()
## [1] "Mon Dec 4 21:24:22 2023"
Assignment 2: Linear regression on the Learning2014 dataset
1. Reading the data
#Dependencies
#install.packages(c("readr","lmtest", "dplyr"))
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Timestamp
date()
## [1] "Mon Dec 4 21:24:23 2023"
2. Analysis
lrn2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.1 The shape of the data
## [1] "dimensions: 166" "dimensions: 7"
There are 166 rows with 7 columns each.
The spec is:
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
2.2 Descriptive statistics
Sex distribution
gender seems to be a categorical value, so let’s see the
number of rows per gender (sex):
## # A tibble: 2 × 2
## gender count
## <chr> <int>
## 1 F 110
## 2 M 56
110 F, and 56 M, there is a skew towards female students in the dataset. Let’s plot that.
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
fig. 1.1, sex distributions
Age distribution
Let’s plot the age distribution of the students.
lrn2014 %>%
ggplot(aes(x = age)) +
stat_count()
fig. 1.2, age distributions
min(lrn2014$age)
## [1] 17
max(lrn2014$age)
## [1] 55
median(lrn2014$age)
## [1] 22
The age range is from 17 to 55, and the median is 22. Visually inspecting the distribution, the mode of the distribution is early twenties, as you would expect, although there is a long tail.
Age-sex distribution
Let’s combine the two columns into a classic population pyramid, or age-sex pyramid.
But that’s not exactly what we want. It turns out a population pyramid is not an out-of-the-box plot we can easily produce, we have to manually calculate bins and do some magic.
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>% # Group by bin and gender
summarize(count =n()) %>% # Sum over groups
mutate(count =
if_else(gender == 'M', -count, count)) %>% # Turn one negative
ggplot(aes(x=count, y = age_bin)) +
geom_col(aes(fill=gender))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3a, population pyramid
There are very few male students under 20, I would speculate that this is due to Finnish army conscription, otherwise the distribution seems roughly equal on the female and male sides.
We can of course bin by year instead of 5 years, and we get a higher resolution but more noise.
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3b, population pyramid #2, fine-grained bins
There is one peculiar decline in the female student participation around ~26-28 which jumps back after thirty. This might be a maternity effect, but this is highly speculative, there’s very few samples in this dataset.
Exam scores
Let’s look at exam scores:
paste("median:", median(lrn2014$points), ", mean:",mean(lrn2014$points), ", standard deviation:", sd(lrn2014$points))
## [1] "median: 23 , mean: 22.7168674698795 , standard deviation: 5.89488364245529"
Let’s look at the full distribution usin
geom_density.
lrn2014 %>%
ggplot(aes(x=points)) +
geom_density()
fig. 1.4a, exam score distribution
There’s a curious valley in the density at around 12-14 points. Let’s look closer.
lrn2014 %>%
ggplot(aes(x=points, tickwidth=1)) +
geom_histogram(boundary=0,binwidth=1) +
scale_x_continuous(breaks = seq(0, 40, by = 1))
fig. 1.4b, exam score histogram
So no students got 12,13, or 14 points. The jump from 11 to 15 must be behind some barrier. Let’s look at our two demographic variables.
Exploring exam score distributions for different groups
Point means by sex, with age gradient
lrn2014 %>%
group_by(gender) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=factor(age))) +
facet_wrap(~gender)
fig. 1.5a, exam scores by sex, with age gradient
I see no clear bias either way, perhaps a slight correlation with score and age within the mode (20-30 years) and then no correlation for higher ages. The female distribution seems most well-behaved, although the gap from 11 points up is much sharper here as well.
Point means by age group, with sex
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=gender)) +
facet_wrap(~age_bin)
fig. 1.5b, exam scores by age group, with labeled sex
Point means by age group, box plot
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()
fig. 1.6a, exam score distributions by age group, box plot sequence
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()+
facet_wrap(~gender)
fig. 1.6b, exam score distributions by age-sex group, box plot sequence
I see no correlation between age and exam score in these plots.
2.3 Correlation matrix
Finally, let us look at all the survey question scores together with the already explored variables in a correlation matrix.
lrn2014 %>%
select(gender, age, surf, stra, deep, attitude, points) %>%
ggpairs(aes(fill=gender, color=gender),lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 1.7, correlation matrix of all variables
There seem to be negative correlations between: - surf
and deep (but seemingly only strongly for male students) -
attitude and deep (weak, but stronger for male
students again) - stra and surf (weak)
And a possitive correlation between points and
attitude! This is the strongest linear relationship in the
data. And we can verify that age does not seem to have an effect at
all.
There also seems to be a relationship between attitude
and gender, but no relationship between
attitude and points.
3. Regression
Based on the data exploration, attitude seems the most
likely candidate, and next after that: stra and
surf.
3.1 Simple regression as baseline
Let’s start with a simple regression model of
points ~ attitude, which will be our baseline.
attitude_model <- lrn2014 %>%
lm(points ~ attitude, data = .)
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is clearly a statistically significant relationship between
attitude and points. But R-squared is only
around 18.5%, so there is a lot of variance not explained by the
model.
3.2 Multiple regression, 3 variables
three_var_model <- lrn2014 %>%
lm(points ~ attitude + stra + surf, data = .)
summary(three_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The adjusted R-squared is higher, which means our model is capturing more of the underlying interactions than before, although still below 20%.
It seems that the relationship between points and
surf is not statistically significant.
Let’s drop surf and keep stra, and try
again.
3.3 Multiple regression, 2 variables
two_var_model <- lrn2014 %>%
lm(points ~ attitude + stra, data = .)
summary(two_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Right, this is our best fit yet, based on the adjusted R-squared.
Depending on what our choice of significance level would be in a
hypothesis test, the interaction with stra would be
ignored. At a standard level of a = 0.05, we wouldn’t
reject the null hypothesis that stra has a linear effect on
points.
Let’s test another model, with nothing but stra.
3.4 Simple regression with only stra
stra_model <- lrn2014 %>%
lm(points ~ stra, data = .)
summary(stra_model)
##
## Call:
## lm(formula = points ~ stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
We get very close to a statistically significant result at a standard
significance level of 0.05, but not quite. Let’s drop
stra, since that’s what the assignment asked us to do.
The model we will be using is therefore the baseline model with one
predictor: attitude.
4. Interpretation
4.1 Summary
Let’s rerun that summary:
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The fitted regression coefficients are: - intercept: 11.6372 -
attitude: 3.5255
Which means the that the model predicts the conditional mean of the
exam scores, given an attitude value as:
points = 11.6372 + 3.5255*attitude
If we multiply the attitude coefficient with the range of the
attitude in the population, we can get an idea of how the model assigns
expected exam scores based on a students attitude.
am <- mean(lrn2014$attitude)
as <- sd(lrn2014$attitude)
print("Range of predictor term within a sd:")
## [1] "Range of predictor term within a sd:"
3.5255*c(am-as, am, am+as)
## [1] 8.506549 11.079839 13.653130
print("Range of ŷ_i within a sd:")
## [1] "Range of ŷ_i within a sd:"
11.6732 + 3.5255*c(am-as, am, am+as)
## [1] 20.17975 22.75304 25.32633
print("Range of ŷ_i within two sd's:")
## [1] "Range of ŷ_i within two sd's:"
11.6732 + 3.5255*c(am-2*as, am, am+2*as)
## [1] 17.60646 22.75304 27.89962
spec(lrn2014)
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
So, assuming attitude is Gaussian and that the sample
stddev is a good estimate, the model assigns: an exam score: - between
20.17975 and 25.32633 to a majority of students (about 68%, one stddev
in a Gaussian captures 34.1% of the population) - between 17.60646 and
27.89962 to a super-majority (95%) of students
If we look back at the exam score distribution in figure 1.4, this does capture the mode of the distribution.
3.2 Multiple R-squared
The multiple R-squared value is 0.1906, the standard way to express
this is that “19% of the variation in exam scores is explained by the
attitude variable” (see MABS4IODS, 3.2.1).
I would interpret this to mean that, using the linear model, given
attitude we estimate the expectation of the standard error
(squared) of this prediction to be roughly 80% of what it would be when
simply using the sample mean as a predictor. The estimation assumes the
sample is representative, because we’re using residuals to get this
number.
I’m not quite sure if this more elaborate interpretation is exact, but it’s what I was able to piece together from sources online, mostly wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination).
5. Regression diagnostics
par(mfrow=c(2,2))
plot(attitude_model)
2.0 Regression diagnostics
par(mfrow=c(1,1))
5.1 Assumptions
The assumptions of the model are that: 1. the constant variance assumption (homoscedasticity) 2. the normality assumption: 3. there’s a linear relationship between the predictor and the response
The Residuals vs. Leverage
The Residuals vs. Leverage plot checks that there aren’t any influential outliers that are affecting the fit of the regression coefficients. The plot has a dashed line showing a critical Cooks’s distance value. In our case this dashed line is not visible. Essentially, if a point is very far right and has a high standardized residual (off the central line), it’s an higly influential point and will have to be looked into.
A highly influential point may be an indication of a point that should be excluded, but this has to be done on a case-by-case basis. It might also mean that assumption 3 is violated, that there isn’t a linear relationship between predictors and the response.
The Residuals vs. Fitted plot
The residuals vs fitted plot (and the scale-location plot) gives us a visual way to check for heteroscedasticity (violation of assumption 1). If the red mean-line is not horizontal, it means the residuals have a bias in some region of the response distribution (the vairance is not constant).
I don’t think this means the data is heteroscedastic, it certainly
doesn’t look like it is. But I’m not so familiar with these visual
checks, so I searched the web for a homoscedasticity test, and found the
Breusch-Pagan Test in the lmtest library.
attitude_model_bptest <- bptest(attitude_model)
attitude_model_bptest
##
## studentized Breusch-Pagan test
##
## data: attitude_model
## BP = 0.0039163, df = 1, p-value = 0.9501
A p-value of 0.95 means that there is definitely very
little evidence that the model is heteroscedastic. Ok, good! The trend
has to be much clearer than that then.
Normal QQ-plot
This plot compares the “ideal” quantiles to the sample quantiles. This is used to test for normalcy by comparing the residual distribution against a theoretical normal distribution.
There are some outliers at the bottom left, which may indicate a bimodal distribution (remember that the exam scores looked bimodal as well, fig 1.4). The plot also curves down a little at the upper end, which I believe usually indicates a light left skew.
let’s test the residuals for normalcy, using the Shapiro-Wilk test:
shapiro.test(attitude_model[["residuals"]])
##
## Shapiro-Wilk normality test
##
## data: attitude_model[["residuals"]]
## W = 0.97394, p-value = 0.003195
The p-value is quite small, 0.003, so we to reject the null hypothesis that the residuals are normally distributed.
Hmm, maybe the issue is the grading scale? I’ve tried to fix this, and I can ge the QQ-plot to look nicer with some transformations, but it makes the model test statistics worse. This is the best I can do for now.
Assignment 3: Logistic regression
1. Get the data
library(dplyr)
library(readr)
data_url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
df <- read_csv(data_url)
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2. Summary of variables
The variables in the data set are:
colnames(df)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset contains student school performance data averaged over two courses: portuguese and mathematics (the variables are: G1, G2, G3, absences, failures, paid). To be exact, the “paid” column is only from one of the courses. The other variables are demographic and social, the social variables were collected using surveys.
Two variables: alc_use and high_use are transformations of the original dataset added for this assignment. alc_use is the average alcohol use per day, combining self-reported weekday and weekend usage on a scale of 1 to 5. high_use is a boolean indicating whether the alc_use variable is more than 2.
3. Choose 4 interesting variables in the data to study in relation to alcohol use
I choose:
- Fedu: Father’s education level
- Medu: Mother’s education level
- absences: number of absences on average per course
- G3: final course grade average
My hypotheses is:
- parents education levels (Fedu,Medu) has a small but significant impact on the likelihood of the high_use variable.
- increase in absences incrreases likelihood of high alcohol usage
- decrease in grades increases likelihood of high alcohol usage
4. Explore variables
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
library(ggplot2)
library(GGally)
df %>% select(alc_use, high_use, Fedu, Medu, absences, G3) %>%
ggpairs(aes(fill=high_use, color=high_use))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig. 1.1, correlation matrix of variables, colored by high_use value
- It turns out that the dataset shows absolutely no correlation between parent’s education and alcohol usage.
- Fedu is very highly correlated with Medu though, which is not surprising but interesting to note, people seem to marry within their social class.
- absences has a very high correlation with alcohol use
- grades have a slightly smaller negative correlation with alcohol use
Two out of four hypotheses seem to have some support in the data after this exploration.
5. Logistic regression
Let’s do a logistic regression model using the two statistically significant correlations: G3 and absences.
model <- glm(high_use ~ G3 + absences, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38732 0.43500 -0.890 0.373259
## G3 -0.07606 0.03553 -2.141 0.032311 *
## absences 0.08423 0.02309 3.648 0.000264 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.93 on 367 degrees of freedom
## AIC: 435.93
##
## Number of Fisher Scoring iterations: 4
Both G3 and absences have a p-value of less than 0.5 for the Wald tests of the fit, but we have a better test that’s easier to interpret, the confidence intervals from the model.
5.1 Odds ratio
coeffs <- coefficients(model)
odds_ratio <- coeffs %>% exp
odds_ratio
## (Intercept) G3 absences
## 0.6788751 0.9267616 1.0878762
The odds ratios for G3 and absences are roughly 0.926 and 1.088 For G3, the OR is less than one because the correlation between the variables is negative.
The way to interpret these is that: - for each decrease in final grade average, the odds increase by roughly 8.0% that a student has high alcohol use (1/0.9267 ~= 1.08) - for each increase in absences per course, the odds increase by roughly 8.7% that a student has high alcohol use
But that’s just the average effect the model has fit, let’s look at confidence intervals in the odds ratio
5.2 Confidence intervals
ci <- confint(model) %>% exp
## Waiting for profiling to be done...
ci
## 2.5 % 97.5 %
## (Intercept) 0.2857593 1.5832545
## G3 0.8639686 0.9935498
## absences 1.0419618 1.1407755
The 95% confidence intervals are both on one side of unity, so we can say with 95% certainty that there is an effect for both variables, and the effect increases the odds by: - a factor in the range of [1.007, 1.15] for each decrease in final grade average (again,inverting the odds ratio because the effect is negative) - a factor in the range of [1.04, 1.14] for each increase in absences per course
5.3 Interpretation
There is an effect, although the final grade average effect goes damn near 1 at the low end of the confidence interval, the unit increase in odds is only 0.7%! For absences, the confidence interval is a little tighter, but it still seems a little wide to me for practical use. We would need more samples in order to get a tighter interval.
The two hypotheses that survived the initial exploration both match the outcomes of the logistic regression, and now we can quantify the quality of an estimate of the effect.
6. Predictions
Let’s test out the model with a confusion matrix.
probabilities <- predict(model, type = "response")
df <- df %>%
mutate(probability = probabilities) %>%
mutate(prediction = probability > 0.5)
table(high_use = df$high_use, prediction = df$prediction) %>%
prop.table %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.25675676 0.04324324 0.30000000
## Sum 0.93783784 0.06216216 1.00000000
The confusion matrix tells us that the model has a false positive rate of ~2% and a false negative rate of ~26%. That’s pretty good! High false negative rates are not so bad, they are just missed cases. High false positives would mean that the model is unreliable and cannot be used as an indicator (in this case. The importance of different error types depend on the specific use case and the meaning of negative and positive).
On the other hand, the confusion matrix also tells us that the model preficts 94% of all students to have non-high alcohol use, while in reality the number is 70%. So the model achieves this relative safe indicator status by being rather conservative.
Caveat
We haven’t done a split into a model fit dataset and a validation set, so this confusion matrix is of limited value.
Assignment 4: Classification and Clustering
1. Overview
This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.
The methods used are linear discriminant analysis (LDA) and k-means clustering.
2. The dataset
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
Dataset structure
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.
The columns contain social statistics related to these census areas
(e.g. crim = crime rate, ptratio =
pupil-teacher ratio), data about the housing units in the area
(rm = avg # of rooms per unit, medv = median
housing unit value, age = prop. houses built before 1940),
and data about the location of the area (e.g. dis =
weighted mean of distances to employment centers, chas = 1
if by Charles River, 0 if not by Charles River).
Some of the columns are a little counter-intuitive or difficult to
interpret. E.g., the column age is the proportion of houses
built before 1940, and the column lstat is the proportion
of the population that is lower status. From the Harrison &
Rubinfield paper, lower status means: “1/2 * (proportion of adults
without some hig h school education and proportion of male workers
classified as laborers)”.
Ok, before we move forward, we did see a small issue here, let’s
change chas from an integer to a boolean.
library(dplyr)
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Summary
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Mode :logical
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 FALSE:471
## Median : 0.25651 Median : 0.00 Median : 9.69 TRUE :35
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Looking at this summary, all columns are definitely not created
equal. I am mostly looking at the difference between the mean and
median, which — if they differ by much — can indicate that a variable is
not normally distributed. Some of the columns are close to normal
distributions, but e.g. zn has a median of 0 and a mean of
11.36. Other highly skewed columns are rad,
crim, tax, chas, and
black.
3. Graphical summary
With ggpairs
library(ggplot2)
library(GGally)
Boston_explore %>%
ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix
Ok, lot’s to unpack here, let’s start with the a visual check of each
variable’s distribution (the diagonal). Almost none of the columns look
normally distributed, with perhaps the exception of rm, the
number of rooms.
There are lots of interesting correlations, just looking at the
scatter plots, the three values rm, medv, and
lstat seem to have strikingly strong relationships with
each other with medv, which makes sense to me.
The rad variable, which is an “index of accessibility to
radial highways, is clearly a bimodal, or one could even say a split
distribution. A subset of areas have a much higher index than the
others, and in the scatter plots, this clearly visible line of that
higher-index population seems to consistently cover different ranges of
the other variable than the lower-index population. The effect is most
clearly noticeable in the crim, tax,
nox, dis and black scatter
plots.
dis and nox also have a strikingly
logarithmic-looking relationship.
In general, nearly every variable seems to be correlated with every
other variable, excepting the chas (area is by the Charles
river) column.
With corrplot
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot
Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.
We see strong correlations (large balls) between: - dis
and (zn, indus, nox, and
age) - tax and rad, and this is a
very strong correltaion, they seem to capture much of the same
variation within them - tax and (crim,
indus, and nox) - ditto for rad -
lstat and medv
4. Standardize and summarize
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
Let’s run the scale function on the dataset.
Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.
Create a categorical crime rate column
# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.
Create training and test datasets
set.seed(179693716)
n <- nrow(Boston_scaled)
split <- sample(n, size = n * 0.8)
train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]
Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.
5. Linear discriminant analysis
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2475248 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.85485581 -0.8956178 -0.08120770 -0.8436157 0.4413548 -0.8603410
## med_low -0.09212737 -0.2763238 0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309 0.1791469 0.26643202 0.3710782 0.1054813 0.4474274
## high -0.48724019 1.0170298 -0.04947434 1.0820880 -0.4230980 0.8279971
## dis rad tax ptratio black lstat
## low 0.8290333 -0.6897369 -0.7642825 -0.43374243 0.3887547 -0.77819536
## med_low 0.3902633 -0.5443053 -0.4238660 -0.08385135 0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126 0.1295279 0.06516648
## high -0.8564037 1.6390172 1.5146914 0.78181164 -0.8395683 0.90684062
## medv
## low 0.52171440
## med_low -0.04934231
## med_high 0.14052867
## high -0.73164690
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12991855 0.717286290 -1.11409479
## indus 0.01318562 -0.275547338 0.12479393
## chas -0.08510193 -0.049959856 0.17108503
## nox 0.50033989 -0.751262885 -1.27531579
## rm -0.10015731 -0.108297341 -0.19190232
## age 0.21057791 -0.358696692 -0.14782657
## dis -0.05526417 -0.347261139 0.38293342
## rad 3.16593191 1.032967549 -0.36643042
## tax -0.01168751 -0.096843332 1.10072573
## ptratio 0.12524469 0.004459516 -0.41009220
## black -0.12747580 -0.014860435 0.07573253
## lstat 0.22137355 -0.316638242 0.31488347
## medv 0.17976929 -0.444140572 -0.22238434
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9509 0.0340 0.0151
Biplot
arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = scale * heads[,choices[1]],
y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
text(scale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot
We see that out of the two first linear discriminants, LD1 nearly
perfectly separates the data into two clusters: those with high crime
rate, and those with other values. rad has the clearly
highest coefficient in LD1, which can be seen both from the biplot and
the LDA summary.
LD2 seems to find another axis within the data that explains a
smaller effect. The largest coefficients in LD2 belong to
nox, medv, and zn.
6. Validation in test set
# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)
Let’s predict the crime rate quartiles in the test set and cross tabulate:
# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)
# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
## predicted
## correct low med_low med_high high
## low 15 9 0 0
## med_low 5 13 8 0
## med_high 1 10 19 1
## high 0 0 0 21
nrow(test)
## [1] 102
and here’s the same table as a confusion matrix:
image(tab)
fig. 6.1 Confusion matrix of the LDA fit
The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).
68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!
6.Extra: let’s try with only rad
lda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit
Using only rad gives us a model that seems to have
exactly the same predictive power in the high vs not
high case, but loses information in the lower quartiles.
This fits with our earlier analysis of how LD1 was mostly
rad and was able to carve out most of the high
crime rate areas from the rest.
7. K-means
The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.
I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.
# determine the number of clusters
#k_max <- 10
# calculate the total within sum of squares, take 5 samples to stabilize the variance
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})
df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))
df <- df %>% rowwise() %>% mutate(twcss = mean(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
geom_line() +
geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
theme(legend.position="none") +
scale_x_continuous(breaks=df$k) +
scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot
It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.
k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.
I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.
Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k2m$centers
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
It seems that the model has picked two clusters that have the following relative position to each other:
- the red cluster has a much lower crime rate
- the red cluster has a lower radial highway access index
- the red cluster has a lower tax rate
- the red cluster has a much higher proportion of black residents
- the red cluster has a lower proportion of buildings built prior to 1940
- the red cluster has a similar median value distribution, shifted towards a higher evaluation (excepting a blue bump right at the top of the evaluation range)
- the blue cluster has a much smaller distance to employment centers
- the blue cluster has a much smaller proportion of residential land zoned for large plots
- the red cluster has a much smaller proportion of single room apartments and other small housing units
- the red cluster has a smaller proportion of working class people and non-educated adults
So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.
So it seems like the red regions are new developments, new suburbs
farther away from the city, and there may be some price discrimination
in the house prices (medv) connected with the high
proportion of black residents living there. You can see effects of
segregationist US housing policy in the data. E.g., the 1949 housing act
set up a framework to subsidize public housing for whites with clauses
forbidding resale to black people, which means that black people paid
more for housing (see e.g. “Abramovitz & Smith, The Persistence
of Residential Segregation by Race, 1940 to 2010: The Role of Federal
Housing Policy,Families in Society, Vol. 102, Issue 1” for
more).
7.Extra Let’s try with k=3
Why not try with k=3 for good measure? Maybe we can find additional structure in the data.
set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k3m$centers
## crim zn indus chas nox rm age
## 1 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456 0.1673019
## 3 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873 0.7828949
## dis rad tax ptratio black lstat
## 1 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397 -0.05818052
## 3 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974 0.90799414
## medv
## 1 0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.
The differences in the red and green clusters seem to be: - the red
cluster has higher values of zn more big plots - the red
cluster has a smaller proportion of older buildings - the red cluster
consists of exclusively black neighbourhoods, while the green cluster
has some spread - the green cluster is between the red and blue clusters
when it comes to proportion of laborers and uneducated adults
Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.
Assignment 5: Dimensionality reduction
Brief from moodle
Actually, a fairly large selection of statistical methods can be listed under the title “dimensionality reduction techniques”. Most often (nearly always, that is!) the real-world phenomena are multidimensional: they may consist of not just two or three but 5 or 10 or 20 or 50 (or more) dimensions. Of course, we are living only in a three-dimensional (3D) world, so those multiple dimensions may really challenge our imagination. It would be easier to reduce the number of dimensions in one way or another.
We shall now learn the basics of two data science based ways of reducing the dimensions. The principal method here is principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. The most important components can be then used for various purposes, e.g., drawing scatterplots and other fancy graphs that would be quite impossible to achieve with the original variables and too many dimensions.
Multiple correspondence analysis (MCA) and other variations of CA bring us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. The typical graphs show the original classes of the discrete variables on the same “map”, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).
Briefly stated, these methods help to visualize and understand multidimensional phenomena by reducing their dimensionality that may first feel impossible to handle at all.
0. Setup, read the data
library(readr)
library(tibble)
library(GGally)
library(corrplot)
library(dplyr)
library(FactoMineR)
human0 <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)
1. Country names as rownames + summaries
Move the country names to rownames (see Exercise 5.5).Show a graphical overview of the data and show summaries of the variables in the data.Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.(0-3 points)
Rownames
human <- column_to_rownames(human0, "Country")
Summary, statistics
Let’s look at the data, starting with some descriptive statistics:
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Based on the difference between mean and median, and the ranges of
the variables, Edu.Exp and Parli.FM seem the
most normally distributed. Edu2.FM, Labo.FM
and Life.Exp are more skewed or have long tails
(bdifference between mean and median, and the ranges of the variables).
GNI and Mat.Mor are closer to log-normal (or a
related distribution), and maybe Ado.Birth as well.
Summary, graphical
corrplot(cor(human), method="circle",type = "upper", tl.pos = "d")
fig. 1.1, correlation matrix of variables
There are definitely strong correlations in the data. We can identify four different sets of variables from this matrix.
First, there is one set of very strongly correlated
variables: Mat.Mor, Ado.Birth,
Life.Exp, and Edu.Exp. These will likely be
very strongly represented in the first principal component of a PCA.
Second, these four variables also correlate very strongly with
GNI and Edu2.FM, more strongly than that pair
of variables correlate between themselves. Since they correlate less
amongst themselves, there seems to be more degrees of freedom here.
Then there is a third group, Labo.FM and
Parli.F. They seem much less correlated compared to the
other set of variables, however, they are most strongly correlated with
each other. Again, seemingly more degrees of freedom in this group.
ggpairs(human, progress = FALSE)
fig. 1.2, distributions and scatterplots of variables
The correlations become much more clear when looking at the
scatterplots. Some are clearly linear, like Edu.Exp and
Life.Exp. It’s even clearer that GNI might
need to be log-transformed, based on the distribution and the scatter
plot shapes. Possibly Mat.Mor and Ado.Birth
too.
The other distributions look like skewed normal-like distributions, all of them have one big main mode, although there are some interesting concentrations in certain parts of the distributions.
E.g., in Life.Exp, there is a clear plateau from 60 to
roughly 65 or so. This may be related to quality of life before and
after retirement.
Another example is this curious smaller mode in the
Edu2.FM around 0.5, and a subsequent drop after that. This
suggests that there are two overlaid populations, one where the mode is
a little under 1, another where the mode is around 0.5. I don’t have a
hypothesis for what could be causing this difference in behaviors in the
two populations of countries, but it’s a very interesting feature of the
distribution. Unfortunately we’ve already thrown away the two variables
that this variable is based on, which may have given us some clues to
what this effect is.
Let’s do some transforms, just for fun
human.log <- human %>% mutate(GNI = log(GNI)) %>% mutate(Mat.Mor = log(Mat.Mor)) %>% mutate(Ado.Birth = log(Ado.Birth))
ggpairs(human.log, progress = FALSE)
fig. 1.3, distributions and scatterplots of variables after log transform
The scatter plots now seem to form much clearer trend lines, which seems to indicate this was a good idea. Let’s recheck the correlation matrix, just to check that the scale of the data didn’t mix things up too much.
corrplot(cor(human.log), method="circle",type = "upper", tl.pos = "d")
fig. 1.4, correlation matrix using log transformed dataset
Well, this does seem to have changed a lot of things. The
correlations are now much more even among five variables:
Life.Exp, Edu.Exp, GNI,
Mat.Mor, and Ado.Birth. The correlation with
Edu2.FM is a little weaker, and the Parli.F
and Labo.FM variables are again least correlated with the
others.
2. PCA with raw data
Perform principal component analysis (PCA) on the raw (non-standardized) human data.Show the variability captured by the principal components.Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.(0-2 points)
Since the log transform was not part of the assignment, let’s forget it for now, and do PCA on the raw dataset and look at the coefficients.
pca_raw <- prcomp(human)
pca_raw
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
PC1 is nearly all GNI, PC2 is nearly all Mat.Mor, and PC3 is nearly
all Ado.Birth, etc. This is a problem. Let’s see why by looking at the
summary of the transformed rows, which are in
pca_raw$x.
summary(pca_raw$x)
## PC1 PC2 PC3 PC4
## Min. :-105495 Min. :-858.85 Min. :-87.306 Min. :-37.798
## 1st Qu.: -6885 1st Qu.: -75.30 1st Qu.:-11.498 1st Qu.: -6.643
## Median : 5587 Median : 60.76 Median : 2.681 Median : 1.787
## Mean : 0 Mean : 0.00 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 13430 3rd Qu.: 117.75 3rd Qu.: 13.592 3rd Qu.: 8.200
## Max. : 17051 Max. : 200.85 Max. : 99.552 Max. : 26.267
## PC5 PC6 PC7 PC8
## Min. :-16.0452 Min. :-4.38647 Min. :-0.70772 Min. :-0.43774
## 1st Qu.: -2.0796 1st Qu.:-1.10824 1st Qu.:-0.08793 1st Qu.:-0.08919
## Median : 0.5794 Median : 0.07198 Median : 0.02775 Median :-0.01657
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 2.3562 3rd Qu.: 1.09716 3rd Qu.: 0.10801 3rd Qu.: 0.08364
## Max. : 7.6214 Max. : 3.80771 Max. : 0.80521 Max. : 0.50052
As we saw earlier, GNI per capita goes from roughly 500 to 100000, while the other variables were all max in the hundreds. Since PCA will maximize the spread along each principal component, we will get some bad decisions, because the distance scales in these variables are not comparable (the dataset needs to be standardized).
If we look at a summary of the pca, this is even more clear:
summary(pca_raw)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
This tells us that PC1 captures nearly all of the variance (>0.9999) in the dataset, which again is due to the dataset not being standardized.
Knowing this, we can’t expect a very good biplot, but let’s plot one anyway.
pca_pr <- round(100*summary(pca_raw)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_raw, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("Negative GNI,", labels[1]), ylab = paste("Maternal mortality,",labels[2]))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
fig. 2.1, biplot for PCA of the raw data
Again, we see that PC1 is pretty much only GNI. This is evident from the fact that the GNI arrow is the only visible one. This can also be inferred from looking at the order of the countries, high GNI countries are to the left, low GNI countries to the right.
I’ve named the principal components according to the variable that they’ve picked up from the dataset (negative GNI and Maternal Mortality), since each of the first few components are almost aligned with one dimension.
3. PCA with standardized variables
Standardize the variables in the human data and repeat the above analysis.Interpret the results of both analysis (with and without standardizing).Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.(0-4 points)
Let’s standardize, do a PCA, and look at the principal components.
pca_scaled <- human %>% scale %>% as.data.frame %>% prcomp
pca_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 -0.2625067
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 -0.1628935
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 -0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 0.1523699
## PC6 PC7 PC8
## Edu2.FM -0.17713316 0.05773644 0.16459453
## Labo.FM 0.03500707 -0.22729927 -0.07304568
## Life.Exp 0.42242796 -0.43406432 0.62737008
## Edu.Exp 0.38606919 0.77962966 -0.05415984
## GNI -0.11120305 -0.13711838 -0.16961173
## Mat.Mor -0.17370039 0.35380306 0.72193946
## Ado.Birth 0.76056557 -0.06897064 -0.14335186
## Parli.F -0.13749772 0.00568387 -0.02306476
This already looks much better. Let’s see the spreads.
summary(pca_scaled$x)
## PC1 PC2 PC3 PC4
## Min. :-3.4128 Min. :-2.79379 Min. :-2.13797 Min. :-3.43435
## 1st Qu.:-1.4651 1st Qu.:-0.61041 1st Qu.:-0.61625 1st Qu.:-0.51644
## Median :-0.4934 Median : 0.04166 Median :-0.07731 Median : 0.05376
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0237 3rd Qu.: 0.68895 3rd Qu.: 0.47541 3rd Qu.: 0.54936
## Max. : 6.0717 Max. : 3.12552 Max. : 2.56743 Max. : 2.27574
## PC5 PC6 PC7 PC8
## Min. :-1.54550 Min. :-1.82764 Min. :-0.98931 Min. :-1.0576
## 1st Qu.:-0.41848 1st Qu.:-0.26156 1st Qu.:-0.29823 1st Qu.:-0.1581
## Median :-0.06288 Median :-0.02236 Median :-0.02144 Median : 0.0261
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.33133 3rd Qu.: 0.34286 3rd Qu.: 0.29026 3rd Qu.: 0.1657
## Max. : 2.74409 Max. : 1.49002 Max. : 1.13872 Max. : 1.3844
Ok, the range of the dimensions seem much more sensible now, they are all in the same order of magnitude. Let’s see how much variance each principal component explains.
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
PC1 explains more than half the variance, not bad! All principal components do seem to capture at least one percent of the variance however, so we weill be losing information if we decide to cut this off.
pca_pr <- round(100*summary(pca_scaled)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("neg. Human development,", labels[1]), ylab = paste("Gender equality,",labels[2]))
fig. 3.1, biplot for PCA of the scaled data
Much better, and better yet, all original variables are roughly axis-aligned. I’ve added descriptive labels based on which variables align with which axes, more on these in section 4.
3.1 Interpretation
Raw data vs scaled data
As already discussed, the raw data was a bad fit for PCA due to the different orders of magnitude in the dispersion among the variables. PCA on raw data essentially just picked out one variable at a time in decreasing order of scale.
PCA on the scaled data performs much better.
PC1 is neg. Human development
I’ve decided to name PC1 neg. Human development,
inspired by the name of the original dataset. This principal component
measures the welfare of the country in terms of health (Mat.Mor,
Ado.Birth, Life.Exp), standard of living (GNI per cap., ppp adjusted),
and education (Edu.Exp, Edu2.FM). The value of this component is smaller
with better outcomes in these domains, which is where the
neg. comes in. I would rather take the negative of this PC1
and call it Human development, but this is what the PCA
gave us.
PC2 is gender equality
PC2 I’ve called gender equality, because it measures female participation in political decision-making (0.65 * Parli.F) and female participation in the labour market (0.72 * Labo.FM).
This is perhaps not the whole story, because this component has a positive contribution from maternal mortality rates and adolescent birth. Maybe this component only measures whether society has moved away from traditional gender roles. In that case, this can be seen as a cultural liberal/conservative axis.
PC3 is negative female political empowerment, or negative attitudes towards female leadership
PC3 is positively correlated with female participation in political decision-making (0.73 * Parli.F), but negatively correlated with the ratio of female participation in labor markets and secondary education (-0.584 * Labo.FM, -0.24 * Edu2.FM).
Let’s flip this around and consider NPC3 = -PC3, it’s
easier to reason about that way. If a country has a high level of female
MPs relative to the female education and participation in the labor
market, then NPC3 is high.
So this principal component measures how women are viewed as in society. Are they seen as leaders (and elected into parliament)? then PC3 is low. Are they seen as useful in the workforce but not as leaders? then PC3 is high.
PC4 is difficult to interpret
The principal components are getting harder and harder to reason about as we go further down the list.
Roughly,
PC4 = 0.62 Edu2.FM - 0.72 GNI - 0.25 Mat.Mor.
These variables don’t seem to make a clear story. Edu2.FM should be very close to 1 for all developed nations, as high school dropouts are rare, and with negative GNI per cap. maybe this axis is about the economic development of the country? The maternal mortality rate is difficult to square with this.
This component might measure the relative focus on wealth as compared to other societal welfare in the country. With heavy emphasis on might.
4. Interpret biplot
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.(0-2 points)
The principal components
I already touched on this in the previous section, but as the first
principal components align with Mat.Mor,
Ado.Birth, Life.Exp, GNI,
Edu.Exp, and Edu2.FM, it seems to capture the
general wellbeing of society. Or rather, the lack thereof, since it’s
pointing in the other direction.
The second axis measures gender equality, as it aligns well with
Parli.F and Labo.FM.
Correlations
There are some additional interesting correlations in the original
variables with the PCs. Mat.Mor and Edu.Exp
both point slightly in the positive PC2 direction. This doesn’t make a
lot of intuitive sense if we consider PC2 to capture gender equality or
liberal values. However, Mat.Mor is pointing in the
opposite direction from the social equality trend line, which may
suggest that the principal components aren’t aligned in any
particularily meaningful direction. PCA has maximiced the dispersion
along PC1 and PC2, but it does not necessarily mean that the axes are
meaningful in themselves, the axes may be slightly misaligned compared
to the labels I’ve given them.
This idea is supported by the fact that PC1 is bimodal, so the second, smaller mode might then be pulling the PCA slightly off the trend line. The other mode consists mostly of African countries and other developing nations. The fact that the other variables are distributed differently here might be due to artifacts of colonialism or some other big systemic differences between developed and developing nations.
ggplot(data.frame(pca_scaled$x), aes(x=PC1)) + geom_density()
fig. 3.2, PC1 distribution, clearly bimodal
Parli.F and Labo.FM are pointing in
slightly different directions from each other, with
Parli.FM slightly aligned with GNI, and
Labo.FM completely orthogonal to it. This suggests that
female labor participation has no effect on GNI (it doesn’t matter for
the economy what the sex of a worker is), but female leadership has a
positive correlation with GNI (of course, we don’t have a causal link
here, only a correlation).
EXTRA: redoing the PCA with the log-transformed data
Just for fun, let’s use the log transformed data to see if some of those correlations change
pca_log_scaled <- human.log %>% scale %>% as.data.frame %>% prcomp
summary(pca_log_scaled)
biplot(pca_log_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"))
fig. 4.1, biplot for PCA of the standardized and log transformed data
Sure enough, if we look at the summary, we see that the first PC now captures 57% of the variance compared to 53.6% before, a better result. The first four PCs now capture 90% of the variance, after that we get diminishing returns.
pca_log_scaled
## Standard deviations (1, .., p=8):
## [1] 2.1505310 1.1259172 0.8681023 0.7727862 0.5558253 0.4410956 0.3827107
## [8] 0.3267292
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.32850815 0.01067147 -0.35894548 0.77560185 -0.36060516
## Labo.FM 0.03263713 0.72942195 -0.61074433 -0.23458714 0.05083415
## Life.Exp -0.42542035 -0.05622607 0.11141581 -0.03806539 0.26227538
## Edu.Exp -0.42035073 0.09697456 -0.06818573 -0.03839701 0.44933876
## GNI -0.43314660 -0.09480875 -0.01735880 0.06836804 0.13860792
## Mat.Mor 0.43672168 0.01162213 -0.02391875 0.21779671 -0.12263401
## Ado.Birth 0.38284906 0.03041796 -0.03340255 0.48515646 0.74256569
## Parli.F -0.09178670 0.66724454 0.69216873 0.23021937 -0.10502880
## PC6 PC7 PC8
## Edu2.FM 0.05763161 0.12121944 0.11626173
## Labo.FM 0.12840535 -0.11338286 -0.08313215
## Life.Exp 0.73375796 0.22248569 -0.38118853
## Edu.Exp -0.56661984 0.53272731 -0.03187199
## GNI -0.27117249 -0.74980849 -0.37876162
## Mat.Mor -0.17790440 0.22051498 -0.81597528
## Ado.Birth 0.11877107 -0.16353812 0.15412247
## Parli.F -0.03795809 -0.03959254 0.01489879
Looking closer at the coefficients, PC1, PC2, and PC3 have roughly the same interpretation as before. PC4 has changed now, but it is still as difficult to interpret as before.
5. MCA on tea data
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)
Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!).Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA.Comment on the output of the plots.(0-4 points)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
#View(tea)
The dataset seems to consist of 300 rows of 36 answers to questionnaire, each row represents one questionnaire.
Summary statistics follow.
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
Factominer documentation (https://rdrr.io/cran/FactoMineR/man/tea.html) says the first 18 vars are about how they drink tea, 19 is age, and the rest are personal questions and “product’s perception” which I’m choosing to interpret as how they think about tea.
5.1 MCA
There are a lot of variables here, so let’s choose a sensible subset of them to look at. Let’s choose the following variables, which I’m interpreting free-hand here, since I can’t find much metadata:
Tea: what kind of tea out of three types the respondent prefers (green, black, Earl Grey)price: the price level of the tea that the respondent prefersHow: whether the respondent takes tea as is, with lemon, with milk, or in some other waysex: the sex of the respondentSPC: some kind of general social group for the respondent (student, employee (white collar?), middle (management?), senior)age_Q: the age group of the respondentfrequency: how often the respondent drinks tea
of these, Tea, price, and How
are “active” variables, and the rest are “supplementary” variables.
As a sidenote, this dataset is very badly documented.
tea_filtered <- tea %>% dplyr::select(one_of("Tea", "price", "How", "sex", "SPC", "age_Q", "frequency"))
summary(tea_filtered)
## Tea price How sex SPC
## black : 74 p_branded : 95 alone:195 F:178 employee :59
## Earl Grey:193 p_cheap : 7 lemon: 33 M:122 middle :40
## green : 33 p_private label: 21 milk : 63 non-worker :64
## p_unknown : 12 other: 9 other worker:20
## p_upscale : 53 senior :35
## p_variable :112 student :70
## workman :12
## age_Q frequency
## +60 :38 +2/day :127
## 15-24:92 1 to 2/week: 44
## 25-34:69 1/day : 95
## 35-44:40 3 to 6/week: 34
## 45-59:61
##
##
Let’s look at a summary of the MCA of the dataset based on these specs above.
mca=MCA(tea_filtered,quali.sup=4:7 ,graph=F)
summary(mca)
##
## Call:
## MCA(X = tea_filtered, quali.sup = 4:7, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.423 0.399 0.383 0.355 0.335 0.330 0.316
## % of var. 12.688 11.967 11.497 10.653 10.043 9.907 9.484
## Cumulative % of var. 12.688 24.655 36.152 46.805 56.848 66.755 76.239
## Dim.8 Dim.9 Dim.10
## Variance 0.283 0.271 0.238
## % of var. 8.490 8.142 7.129
## Cumulative % of var. 84.729 92.871 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.720 0.409 0.056 | 1.341 1.503 0.196 | 0.789 0.541
## 2 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 3 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 4 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 5 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 6 | -0.357 0.100 0.027 | -0.078 0.005 0.001 | 0.344 0.103
## 7 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 8 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 9 | 0.497 0.195 0.083 | 0.222 0.041 0.016 | 0.544 0.257
## 10 | 1.104 0.961 0.443 | -0.678 0.384 0.167 | 0.138 0.016
## cos2
## 1 0.068 |
## 2 0.002 |
## 3 0.542 |
## 4 0.542 |
## 5 0.542 |
## 6 0.025 |
## 7 0.542 |
## 8 0.002 |
## 9 0.099 |
## 10 0.007 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.366 36.278 0.611 13.516 | -0.083 0.143 0.002
## Earl Grey | -0.440 9.794 0.348 -10.207 | 0.312 5.225 0.175
## green | -0.493 2.106 0.030 -2.996 | -1.636 24.612 0.331
## p_branded | -0.497 6.176 0.115 -5.856 | -0.218 1.257 0.022
## p_cheap | 1.116 2.289 0.030 2.982 | 1.313 3.361 0.041
## p_private label | 0.018 0.002 0.000 0.084 | -0.132 0.102 0.001
## p_unknown | 0.314 0.310 0.004 1.108 | 2.953 29.139 0.363
## p_upscale | 1.063 15.724 0.242 8.512 | -0.874 11.272 0.164
## p_variable | -0.188 1.036 0.021 -2.504 | 0.225 1.575 0.030
## alone | -0.275 3.870 0.140 -6.477 | -0.328 5.831 0.199
## v.test Dim.3 ctr cos2 v.test
## black -0.825 | 0.099 0.209 0.003 0.977 |
## Earl Grey 7.240 | -0.247 3.420 0.110 -5.741 |
## green -9.947 | 1.224 14.341 0.185 7.443 |
## p_branded -2.566 | 0.459 5.803 0.098 5.403 |
## p_cheap 3.509 | 0.315 0.201 0.002 0.841 |
## p_private label -0.626 | 1.039 6.569 0.081 4.928 |
## p_unknown 10.421 | 1.519 8.030 0.096 5.362 |
## p_upscale -6.999 | 0.310 1.476 0.021 2.483 |
## p_variable 2.999 | -0.913 27.080 0.497 -12.188 |
## alone -7.721 | -0.153 1.330 0.044 -3.614 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.611 0.359 0.207 |
## price | 0.324 0.559 0.565 |
## How | 0.334 0.279 0.378 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2
## F | -0.099 0.014 -2.071 | -0.001 0.000 -0.018 | -0.092 0.012
## M | 0.145 0.014 2.071 | 0.001 0.000 0.018 | 0.135 0.012
## employee | -0.153 0.006 -1.307 | -0.104 0.003 -0.891 | 0.027 0.000
## middle | 0.179 0.005 1.213 | 0.095 0.001 0.645 | 0.033 0.000
## non-worker | 0.281 0.021 2.531 | -0.214 0.012 -1.927 | -0.010 0.000
## other worker | 0.221 0.003 1.021 | 0.174 0.002 0.804 | 0.108 0.001
## senior | 0.226 0.007 1.419 | 0.086 0.001 0.543 | 0.009 0.000
## student | -0.350 0.037 -3.343 | 0.108 0.004 1.027 | -0.090 0.002
## workman | -0.327 0.004 -1.154 | 0.167 0.001 0.588 | 0.133 0.001
## +60 | 0.702 0.072 4.624 | -0.316 0.014 -2.078 | -0.033 0.000
## v.test
## F -1.930 |
## M 1.930 |
## employee 0.229 |
## middle 0.225 |
## non-worker -0.091 |
## other worker 0.498 |
## senior 0.056 |
## student -0.861 |
## workman 0.470 |
## +60 -0.218 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.014 0.000 0.012 |
## SPC | 0.068 0.020 0.004 |
## age_Q | 0.130 0.022 0.014 |
## frequency | 0.011 0.007 0.026 |
5.2 Plotting and interpretation
Let’s start with a plot of the individuals (ind):
par(mfrow=c(2,2))
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)
I see no real clusters, but there seem to be some duplicates. Dimension 1 covers 12.6% of the variance, and dimension 2 covers 12% of the variance, roughly the same. The scatter plot forms a triangle in this space, with a wide base forming around the Dim 1 = 0 and extending to the right.
Now let’s plot the var variables — Tea,
price, and How — which the plot function for
FactoMineR’s MCA colors black, red, and green, respectively.
plot(mca, graph.type="ggplot", invisible=c("ind","quali.sup","quanti.sup"), cex=0.8, habillage="quali")
Dimension 1 seems to cover: - most strongly (Tea: 0.61), whether the
respondent prefers black tea or not (black +1.36, green -0.49, Earl Grey
-0.44) - then (How: 0.334), whether the respondenr likes to add things
to their tea (alone -0.28, lemon, milk, other all positive +) - and
finally, no clear correlation with price, since both upscale and cheap
are positive in dimension 1
Dimension 2 seems to cover: - whether the respondent is unlikely to prefer green tea (-1.636) - whether the respondent likes cheap tea (cheap +1.313, upscale -0.874)
The p_unknown category in the price has the highest
coefficient here, which makes this one tough to interpret.
Finally, let’s look at the demographics of the individuals and see how they distribute over this space:
plot(mca, graph.type="ggplot", invisible=c("ind","var","quanti.sup"), cex=0.8, habillage="quali")
It seems that age is the best explaining variable for the differences in respondents tea drinking habits (dim 1: 0.130, dim 2: 0.022), followed by SPC, which does encode some of the same information as age, so perhaps it is redundant. If we interpret the dimensions from before we could roughly say that: - older people are more likely to prefer generic black tea to green tea or Earl Grey - younger people are more likely to prefer Earl Grey - green tea is most likely to be preferred by people in the 25-34 age range who are “employees” (office workers?) - older people prefer to add milk, lemon, or something else to their tea
There is also a slight difference in the distribution of answers between sexes: - men more likely to prefer black tea - women more likely to prefer Earl Grey or green tea
The frequency of tea drinking is all over the place and forms no clear trendline in the biplot.
Assignment 6: Analyzing longitudinal data
library(readr)
library(dplyr)
library(GGally)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(lme4)
## Loading required package: Matrix
1. Analyzing the rats dataset using the summary measure
approach
In this problem, we’re using the summary measure approach on the
rats dataset.
The rats data is from a nutrition study presented in the
textbook Analysis of Repeated Measures (Crowdeer & Hand,
1990) on three groups of rats, each group given a different diet. The
rats were weighed weekly over a 9-week period to study the effects of
the different diets.
rats <- read_csv("data/rats.csv") %>% mutate(ID = factor(ID)) %>% mutate(Group = factor(Group))
## Rows: 176 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): ID, Group, Bodyweight, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1.1A Dataset visualization, weight
Let’s inspect:
- the distribution of initial bodyweights
- the distribution of final bodyweights
- the distribution of bodyweights in the three groups
- the mean growth profile
- the growth profile of the three groups
Below, there are four tabs in the notebook, you can switch between them to look at different aspects of the data.
Distribution shift
We start by looking at the distribution of initial and final bodyweights to see if we can detect a trend.
initial_time <- min(rats$Time)
final_time <- max(rats$Time)
initial_rats <- rats %>% filter(Time == initial_time)
final_rats <- rats %>% filter(Time == final_time)
min_range <- min(rats$Bodyweight)
max_range <- max(rats$Bodyweight)
ggplot() + geom_density(data= initial_rats, bw = "SJ", aes(x=Bodyweight, fill = "Baseline bodyweight", alpha=0.4)) +
geom_density(data= final_rats, bw = "SJ", aes(x=Bodyweight, fill = "Final bodyweight", alpha=0.4)) +
scale_fill_manual(values=c("Baseline bodyweight" = "#cccccc","Final bodyweight" = "#ff9955")) +
labs(fill= "Weight distribution") +
scale_x_continuous(limits = c(min_range, max_range)) +
scale_alpha(guide = "none")
fig. 1.1, Baseline and final rat weight distributions
There is a clear shift towards a heavier distribution of bodyweights over the course of the experiment — the rats are growing. It looks like there’s three modes in both distribution, but in the final bodyweight distribution, the middle mode has gotten close to the heaviest mode.
Groups
Let’s look at the initial bodyweights of the three groups.
initial_rats %>%
ggplot(aes(x=Bodyweight, group=Group, fill=Group, alpha = 0.5)) +
geom_density(bw = "SJ")+
scale_alpha(guide = "none")
fig. 1.2, Baseline weight distributions per group
Hmm, the groups seem to have different bodyweight distributions, the modes do not overlap, and I have a feeling that the difference between groups is much greater than the difference between the growth from initial to final measurements. This is even clearer in the next tab to the right.
Distribution shift per group
ggplot() + geom_density(data= initial_rats, bw = "SJ", aes(x=Bodyweight, group=Group, fill = paste("Group",Group,"baseline"), alpha=0.4)) +
geom_density(data= final_rats, bw = "SJ", aes(x=Bodyweight, group=Group, fill = paste("Group",Group,"final"), alpha=0.4)) +
scale_fill_manual(values=c("Group 1 baseline" = "#ff9999",
"Group 2 baseline" = "#99ff99",
"Group 3 baseline" = "#9999ff",
"Group 1 final" = "#dd2222",
"Group 2 final" = "#22dd22",
"Group 3 final" = "#2222dd")) +
labs(fill= "Weight distribution") +
scale_x_continuous(limits = c(min_range, max_range)) +
scale_alpha(guide = "none")
fig. 1.3, Baseline and final weight distributions per group
The shift of the distributions are visible, but they are still smaller than the initial differences between the weight distributions. In my mind this would make the groups a little difficult to compare, since we can’t be sure that the effect of the baseline bodyweight doesn’t behave differently for each group. The green high peak is a single outlier, as we will see when looking at the line plots.
Growth trend
rat_trend <- rats %>% group_by(Time) %>%
summarise(mean = mean(Bodyweight), se = sd(Bodyweight)/sqrt(n())) %>%
ungroup()
rat_trend %>% ggplot(aes(x=Time, y = mean)) +
geom_line(aes(color="mean +- se")) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, color = "mean +- se")) +
labs(color= "Group") +
ylab("Bodyweight") +
geom_line(data=rats, aes(x=Time, y=Bodyweight, color = Group, group=ID))
fig. 1.4, Mean growth and plot of each rats bodyweight increase
We see an issue with taking the mean bodyweight of all rats: none of the rats are at the mean, they straddle the mean on both sides. We also see more clearly in this plot that one of the groups (group 1) has double the number of rats that the others do (8 vs. 4). The outlier in the green group is also much more apparent here, we might have to remove it.
It’s clear that there are issues with using the bodyweight itself. If we were to use a linear mixed effects model, we might be able to deal with this. But for a summary measure approach, we might want to capture the growth in some other way, which leads us to the next section.
1.1B Dataset visualization, weight gain
The reason the lines are difficult to compare is because the groups’
baselines (intercepts) are so different. If we could get all of the
lines to start at the same location, we may be able to compare them
better, so let’s create a new column for each observation subtracting
the baseline bodyweight from the measured bodyweight, which we can call
Weight_gain.
rats <- rats %>%
mutate(Weight_gain = Bodyweight - filter(initial_rats, ID == ID)$Bodyweight) %>%
mutate(Baseline_weight = initial_rats[as.integer(ID),]$Bodyweight)
checksum <- rats$Baseline_weight - rats$Bodyweight + rats$Weight_gain
all(checksum == 0)
## [1] TRUE
Again, there are multiple tabs to browse through.
Weight gain after 9W
The initial distribution of weight gain is constant (all zeros) so we will only plot the overall weight gain distribution after 9 weeks.
final_rats <- rats %>% filter(Time == final_time)
ggplot() + geom_density(data=final_rats, bw = "SJ", aes(x=Weight_gain, fill = "0")) +
scale_fill_manual(values=c("0"="#ff9955"), guide = "none") +
xlab("Weight gain after 9 weeks")
fig. 2.1, Weight gain distribution after 9 weeks
There seem to be two modes here, which may indicate a difference between groups.
Weight gain per group
Let’s look at the final bodyweights of the three groups.
final_rats %>%
ggplot(aes(x=Weight_gain, group=Group, fill=Group, alpha = 0.5)) +
geom_density(bw = "SJ")+
scale_alpha(guide = "none")+
xlab("Weight gain after 9 weeks")
fig. 2.2, Weight gain distribution per group after 9 weeks
Okay, there seem to be some difference between the red (1) and green (3) groups, but these were also the groups with the largest differences in the baseline.
Weight gain trend
rat_trend <- rats %>% group_by(Time) %>%
summarise(mean = mean(Bodyweight), mean_gain = mean(Weight_gain), se = sd(Bodyweight)/sqrt(n()),se_gain = sd(Weight_gain)/sqrt(n())) %>%
ungroup()
rat_trend %>% ggplot(aes(x=Time, y = mean_gain,ymin=mean_gain-se_gain, ymax=mean_gain+se_gain)) +
geom_line(aes(color="mean +- se")) +
geom_errorbar(aes(color = "mean +- se")) +
geom_ribbon(aes(fill="0",alpha=0.1)) +
labs(color= "Group") +
ylab("Weight gain") +
scale_fill_manual(guide="none", values =(c("0"="#cccccc"))) +
scale_alpha(guide="none") +
geom_line(data=rats, aes(x=Time, y=Weight_gain,ymin =0, ymax=0, color = Group, group=ID))
## Warning in geom_line(data = rats, aes(x = Time, y = Weight_gain, ymin = 0, :
## Ignoring unknown aesthetics: ymin and ymax
fig. 2.3, Mean growth and plot of each rats bodyweight increase
Looking at the weight gain of each rat, we do start to see some
1.2 Standardize
Let’s standardize the dataset and plot that last plot
(Growth Trend) again, for both the body weight and the
weight gain
rats_std <- rats %>%
group_by(Time) %>%
mutate(std_weight = scale(Bodyweight)) %>%
mutate(std_weight_gain = if_else(Weight_gain != 0, scale(Weight_gain), 0)) %>%
ungroup()
std_rat_trend <- rats_std %>% group_by(Time) %>%
summarise(mean = mean(std_weight), mean_gain = mean(std_weight_gain), se = sd(std_weight)/sqrt(n()),se_gain = sd(std_weight_gain)/sqrt(n())) %>%
ungroup()
weight_plot <- std_rat_trend %>% ggplot(title = "", aes(x=Time, y = mean,ymin=mean-se, ymax=mean+se)) +
geom_line(aes(color="mean +- se", linetype="2")) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, color = "mean +- se"), width=1.5, size = 0.3) +
geom_ribbon(aes(fill="0",alpha=0.1)) +
ylab("std. weight") +
geom_line(data=rats_std, aes(x=Time, y=std_weight, ymin=0, ymax=0, color = Group, group=ID, linetype="1")) +
scale_fill_manual(guide="none", values =(c("0"="#cccccc"))) +
theme(legend.position="none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_line(data = rats_std, aes(x = Time, y = std_weight, ymin = 0, :
## Ignoring unknown aesthetics: ymin and ymax
gain_plot <- std_rat_trend %>% ggplot(aes(x=Time, y = mean_gain, ymin=mean_gain-se_gain, ymax=mean_gain+se_gain)) +
geom_line(aes(color="mean +- se", linetype="2")) +
geom_errorbar(aes( color = "mean +- se"), width=1.5, size = 0.3) +
geom_ribbon(aes(fill="0",alpha=0.1)) +
labs(color= "Group") +
ylab("std. weight gain") +
geom_line(data=rats_std, aes(x=Time, y=std_weight_gain, ymin=0, ymax=0, color = Group, group=ID, linetype="1")) +
scale_linetype(guide="none") +
scale_fill_manual(guide="none", values =(c("0"="#cccccc"))) +
scale_alpha(guide="none") +
theme(legend.position = c(0.92,0.105))
## Warning in geom_line(data = rats_std, aes(x = Time, y = std_weight_gain, :
## Ignoring unknown aesthetics: ymin and ymax
gridExtra::grid.arrange(weight_plot,gain_plot, ncol=2)
fig. 3.1, Standardized weight and weight gain per rat, and all-groups mean
The standardized bodyweight plot is difficult to interpret because of the difference in baselines, but we can identify a potential outlier in the green group.
The standardized weight gain plot does show some patterns, we see some spread among the groups, which we can dive deeper into.
1.3 Group mean and std. error plots
Now let’s do that same plot but instead of individuals, let’s plot the mean and standard error per group
std_rat_group_trends <- rats_std %>% group_by(Time, Group) %>%
summarise(mean = mean(std_weight), mean_gain = mean(std_weight_gain), se = sd(std_weight)/sqrt(n()),se_gain = sd(std_weight_gain)/sqrt(n())) %>%
ungroup()
## `summarise()` has grouped output by 'Time'. You can override using the
## `.groups` argument.
sum_weight_plot <- std_rat_group_trends %>% ggplot(aes(x = Time, y = mean,ymin=mean-se, ymax=mean+se, color = Group)) +
ggtitle("Bodyweight per group") +
geom_line() +
geom_point(size=3) +
geom_linerange() +
geom_ribbon(aes(fill=Group,alpha = 0.1)) +
scale_y_continuous(name = "mean(std. weight) +/- se(std. weight)") +
theme(legend.position="none")
sum_gain_plot <- std_rat_group_trends %>% ggplot(aes(x = Time,ymin=mean_gain-se_gain, ymax=mean_gain+se_gain, y = mean_gain, color = Group)) +
ggtitle("Weight gain per group") +
geom_line() +
geom_point(size=3) +
geom_linerange() +
geom_ribbon(aes(fill=Group,alpha = 0.1)) +
scale_y_continuous(name = "mean(std. weight gain) +/- se(std. weight gain)") +
scale_alpha(guide="none") +
theme(legend.position = c(0.95,0.08))
gridExtra::grid.arrange(sum_weight_plot,sum_gain_plot, ncol=2)
fig. 3.2, Bodyweight and weight gain over time per group
Looking at the weight gain plot specifically, we do see the three groups clearly separating, but we still don’t know how much of this effect would be because of the baseline weight of the rats or the treatment, since both of these variables differed between the groups.
1.4 Outliers
weight_boxplot <- final_rats %>% ggplot(aes(y=Bodyweight)) +
geom_boxplot() +
facet_wrap(~Group)
gain_boxplot <- final_rats %>% ggplot(aes(y=Weight_gain)) +
geom_boxplot() +
facet_wrap(~Group)
gridExtra::grid.arrange(weight_boxplot, gain_boxplot, ncol=2)
fig. 4.1, boxplots of the three groups
There is a rat in group 2 that seems to be increasing the variance in the group, which would make t-tests more difficult to run. Let’s remove it
outlier_ids <- (final_rats %>% filter(Bodyweight > 600))$ID
rats <- rats %>% filter(!(ID %in% outlier_ids))
initial_rats <- rats %>% filter(Time == initial_time)
final_rats <- rats %>% filter(Time == final_time)
weight_boxplot2 <- final_rats %>% ggplot(aes(y=Bodyweight)) +
geom_boxplot() +
facet_wrap(~Group)
gain_boxplot2 <- final_rats %>% ggplot(aes(y=Weight_gain)) +
geom_boxplot() +
facet_wrap(~Group)
gridExtra::grid.arrange(weight_boxplot2, gain_boxplot2, ncol=2)
fig. 4.2, boxplots of the three groups, outliers removed
There is still seemingly one outlier, but the difference in variances between the groups now seems better.
One issue is that the variance of group 1 is much smaller than groups 2 and 3 — likely because there are twice as many rats in that group.
1.5 Summary measure: final weight gain
Visually we have identified a between-groups effect, but there are two possible causes of the effect: the baseline weight and the treatment (diet).
Based on table 8.2 from MABS4IODS, I will use as summary measure the final value of the weight gain, because there is a growth trend in the data. Another option for summary measures would have been the regression coefficients of the bodyweight of each individual rat.
1.5.1 Unpaired t-test
As the baselines are different for each group, using unpaired t-tests we can only compare the weight gain between two groups. The tabs contain the t-tests for each comparison of two groups.
final_rats_12 <- final_rats %>% filter(Group != 3)
final_rats_13 <- final_rats %>% filter(Group != 2)
final_rats_23 <- final_rats %>% filter(Group != 1)
Let’s also check if the variances of the groups are similar:
final_rats %>% group_by(Group) %>%
summarise(var = var(Weight_gain))
## # A tibble: 3 × 2
## Group var
## <fct> <dbl>
## 1 1 86.1
## 2 2 1051
## 3 3 326.
they are not. I will therefore use Welch’s t-test (var.equal = FALSE) when doing the t-tests.
Group 1 and 2
t.test(Weight_gain ~ Group, data = final_rats_12, var.equal = F)
##
## Welch Two Sample t-test
##
## data: Weight_gain by Group
## t = -2.0458, df = 2.1242, p-value = 0.1699
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -116.22098 38.47098
## sample estimates:
## mean in group 1 mean in group 2
## 23.125 62.000
Group 1 and 3
t.test(Weight_gain ~ Group, data = final_rats_13, var.equal = F)
##
## Welch Two Sample t-test
##
## data: Weight_gain by Group
## t = -1.9138, df = 3.8172, p-value = 0.1316
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -45.543111 8.793111
## sample estimates:
## mean in group 1 mean in group 3
## 23.125 41.500
Group 2 and 3
t.test(Weight_gain ~ Group, data = final_rats_23, var.equal = F)
##
## Welch Two Sample t-test
##
## data: Weight_gain by Group
## t = 0.98659, df = 2.932, p-value = 0.3981
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -46.50239 87.50239
## sample estimates:
## mean in group 2 mean in group 3
## 62.0 41.5
There seems to be some difference between groups 1 and 2 and groups 1 and 3, but the t-test is not significant at level 0.05 (the confidence intervals straddle 0). Based on this, we could conclude that there may be something causing the mice in group 1 to gain less weight than the other groups (but we still can’t say if it’s their diet).
1.5.2 ANOVA
Interestingly, the test statistic seems to be the same for
Group regardless of if we use the body weight or the weight
gain as the response variable, so let’s use Weight gain.
For each of the four combinations {1,2,3}, {1,2}, {1,3}, {2,3}, we fit a linear model and then we perform an analysis of variance (ANOVA) on the fit.
Dependent variable:
- weight gain
Independent variables:
- the baseline weight of the rat
- the diet (group)
All three groups
final_fit_gain <- lm(Weight_gain ~ Baseline_weight + Group, data = final_rats)
anova(final_fit_gain)
## Analysis of Variance Table
##
## Response: Weight_gain
## Df Sum Sq Mean Sq F value Pr(>F)
## Baseline_weight 1 1308.3 1308.32 8.1039 0.015889 *
## Group 2 4072.2 2036.11 12.6120 0.001423 **
## Residuals 11 1775.9 161.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Group 1 and 2
final_fit_gain_12 <- lm(Weight_gain ~ Baseline_weight + Group, data = final_rats_12)
anova(final_fit_gain_12)
## Analysis of Variance Table
##
## Response: Weight_gain
## Df Sum Sq Mean Sq F value Pr(>F)
## Baseline_weight 1 2369.3 2369.30 15.279 0.004489 **
## Group 1 2392.3 2392.32 15.427 0.004370 **
## Residuals 8 1240.6 155.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Group 1 and 3
final_fit_gain_13 <- lm(Weight_gain ~ Baseline_weight + Group, data = final_rats_13)
anova(final_fit_gain_13)
## Analysis of Variance Table
##
## Response: Weight_gain
## Df Sum Sq Mean Sq F value Pr(>F)
## Baseline_weight 1 662.01 662.01 6.9201 0.02733 *
## Group 1 957.27 957.27 10.0066 0.01149 *
## Residuals 9 860.98 95.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Group 2 and 3
final_fit_gain_23 <- lm(Weight_gain ~ Baseline_weight + Group, data = final_rats_23)
anova(final_fit_gain_23)
## Analysis of Variance Table
##
## Response: Weight_gain
## Df Sum Sq Mean Sq F value Pr(>F)
## Baseline_weight 1 1870.8 1870.80 6.2693 0.06649 .
## Group 1 735.0 735.00 2.4631 0.19163
## Residuals 4 1193.6 298.41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we can say that it seems like there is a significant effect due to diet, at p=0.001423, and that the effect is between group 1 and the other groups. (1 vs 2: p = 0.004489, 1 vs 3:p = 0.01149).
1.6 Interpretation
Based on the summary measure analysis, we can conclude that there is a statistically significant change in the mean weight gain of a rat when given the same diet that the first group of rats were given.
2. Analyzing the BPRS dataset using linear mixed effect models
For this problem we’re creating a linear mixed effects model for the BPRS dataset.
BPRS is the brief psychiatric rating scale, used as indicator of schizophrenia. The dataset consists of BPRS evaluations from 40 subjects, with one observations per subject taken once a week for 9 weeks. Half the subjects were given one treatment, the other half another treatment.
As preprocessing, I will factorize the treatment and subject columns. Additionally, because the subject id’s are reused in both treatments, I will calculate a unique subject id from the 40 subject,treatment pairs instead.
bprs <- read_csv("data/bprs.csv") %>%
mutate(subject = factor(subject+(treatment-1)*20)) %>%
mutate(treatment = factor(treatment))
## Rows: 360 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): treatment, subject, bprs, week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.1 Dataset visualization
Let’s visualize the data, first a summary of all measurements, then a mean of each group
bprs_trend <- bprs %>%
group_by(week) %>%
summarise(mean_bprs = mean(bprs), se_bprs = sd(bprs)/sqrt(n()), var =var(bprs))
bprs_trend
## # A tibble: 9 × 4
## week mean_bprs se_bprs var
## <dbl> <dbl> <dbl> <dbl>
## 1 0 48 2.23 200.
## 2 1 46.3 2.58 267.
## 3 2 41.7 1.98 157.
## 4 3 39.2 1.97 155.
## 5 4 36.4 1.74 121.
## 6 5 32.6 1.46 85.5
## 7 6 31.2 1.61 104.
## 8 7 32.2 1.81 132.
## 9 8 31.4 1.94 150.
Individuals and overall mean and spread
ggplot(data=bprs_trend, aes(x=week, y=mean_bprs,ymin=mean_bprs-se_bprs, ymax=mean_bprs+se_bprs, linetype="mean +- se")) +
geom_ribbon(aes(alpha=0.1, fill="se"), show.legend = F)+geom_errorbar(width=0.1, size = 0.3) + geom_line() +
geom_line(aes(y=mean_bprs+sqrt(var), linetype="1 stdev")) + geom_line(aes(y=mean_bprs-sqrt(var), linetype="1 stdev")) +
geom_line(data=bprs, aes(y=bprs, ymin=0, ymax=0,color=treatment, group=subject)) +
labs(color= "Treatment", linetype="") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(values =(c("1 stdev"="dashed", "mean +- se"="solid"))) +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
## Warning in geom_line(data = bprs, aes(y = bprs, ymin = 0, ymax = 0, color =
## treatment, : Ignoring unknown aesthetics: ymin and ymax
fig. 5.1, subject bprs over time, colored by treatment
Per group means
bprs_treatment_trend <- bprs %>%
group_by(week, treatment) %>%
summarise(mean_bprs = mean(bprs), se_bprs = sd(bprs)/sqrt(n()), var =var(bprs))
## `summarise()` has grouped output by 'week'. You can override using the
## `.groups` argument.
ggplot(data=bprs_treatment_trend, aes(x=week, y=mean_bprs,ymin=mean_bprs-se_bprs, ymax=mean_bprs+se_bprs, linetype="mean +- se", color = treatment, group = treatment)) +
geom_ribbon(aes(alpha=0.1, fill=treatment), show.legend = F)+geom_errorbar(width=0.1, size = 0.3) + geom_line() +
geom_line(aes(y=mean_bprs+sqrt(var), linetype="1 stdev")) + geom_line(aes(y=mean_bprs-sqrt(var), linetype="1 stdev")) +
labs(color= "Treatment", linetype="") +
ylab("BPRS") +
scale_linetype_manual(values =(c("1 stdev"="dashed", "mean +- se"="solid"))) +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
fig. 5.2, mean bprs per treatment group over time
We can clearly see that most of the variance in the dataset is due to a measurements above the mean than due to measurements below the mean (by looking at how far beyond one standard deviation the measurements go on both sides of the mean) — there is some skew in the distribution. Especially this one outlier is messing with the distribution, and from the per treatment plot, we can deduce that the outlier is in treatment group 2.
Overall, it looks like there isn’t much difference between the two groups, the only possibly interesting feature in the plots is a possible effect at about weeks 6-7, where the slope of both groups seem to change.
2.2 Independent linear model
Let’s first fit a model that ignores the correlation of repeated measures taken on the same subject.
bprs.lm <- lm(bprs ~ week + treatment, bprs)
summary(bprs.lm)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = bprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Based on the independent model, we find no statistically significant effect due to the treatment. This doesn’t necessarily mean anything, since the difference in intercepts and slopes between subjects is very high (as seen in figure 5.1) when compared to the difference in intercepts and slopes between groups.
The linear model does, however also identify the overall slope over time, which is -2.27.
2.3 Random intercept model
bprs.lmri <- lmer(bprs ~ week + treatment + (1|subject), data = bprs, REML = FALSE)
summary(bprs.lmri)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
The variance due to the random intercept is:
\(\hat{\sigma}^2_u = 97.38\)
And the residual variance is:
\(\hat{\sigma}^2 = 54.23\)
Since \(\hat{\sigma}^2_u \gt \hat{\sigma}^2\), I take this to mean that the random intercept was able to reduce the variance.
bprs$fitted_ri <- fitted(bprs.lmri)
ri_observed_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Observed values") +
geom_line(aes(y=bprs)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
ri_fit_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Fitted values") +
geom_line(aes(y=fitted_ri)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
gridExtra::grid.arrange(ri_observed_plot, ri_fit_plot, ncol=2)
fig. 6.1, Model with random intercept
Confidence intervals
Let’s check the confidence intervals to see if there is a clear effect in either direction:
confint(bprs.lmri)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 7.918218 12.645375
## .sigma 6.828332 7.973573
## (Intercept) 41.744598 51.163179
## week -2.565919 -1.974914
## treatment2 -5.885196 7.029641
The confidence interval for the coefficient of treatment2 straddles 0, so there is not a signfificant effect.
2.3 Random intercept + slope model
bprs.lmris <- lmer(bprs ~ week + treatment + (week|subject), data = bprs, REML = FALSE)
summary(bprs.lmris)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
So now we have the following random effects:
\(\hat{\sigma}^2_u = 167.827\)
\(\hat{\sigma}^2_v = 2.331\)
And the residual variance is:
\(\hat{\sigma}^2 = 36.747\)
As the residual variance is less now, we already have a hint that the model is a better fit than the previous one.
Interestingly, the total variance is now much higher. Having discussed this with the course teachers, the intuitive reason for this is that the slopes being adjusted per-subject will tend to spread the intecepts more, as the lines can freely “rotate” into position now.
bprs$fitted_ris <- fitted(bprs.lmris)
ris_observed_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Observed values") +
geom_line(aes(y=bprs)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
ris_fit_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Fitted values") +
geom_line(aes(y=fitted_ris)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
gridExtra::grid.arrange(ris_observed_plot, ris_fit_plot, ncol=2)
fig. 6.2, Model with random intercept and slope
2.4 ANOVA between random intercept and random intercept + slope
anova(bprs.lmris,bprs.lmri)
## Data: bprs
## Models:
## bprs.lmri: bprs ~ week + treatment + (1 | subject)
## bprs.lmris: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs.lmri 5 2582.9 2602.3 -1286.5 2572.9
## bprs.lmris 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The deviance and AIC have both dropped, which tells us that adding the slope gives us a much better fit — AIC, the Akaike information criterion, is a measure of how much information is lost in the model, so less is better. At p=1.5e-14, we can confidently say that each subject definitely has a characteristic slope.
Confidence intervals
Nevertheless, let’s also check the confidence intervals to double check:
confint(bprs.lmris)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 10.3102840 16.7099152
## .sig02 -0.8241817 -0.4128655
## .sig03 1.1562015 2.0277746
## .sigma 5.5925524 6.6009205
## (Intercept) 40.5849755 51.2468100
## week -2.8150999 -1.7257334
## treatment2 -4.9313699 7.9592153
The confidence interval for the coefficient of
treatment2 straddles zero, which means that we cannot
conclude that there is a difference in outcome due to the treatment
using this model.
2.5 Random intercept + slope with interaction
bprs.lmris_int <- lmer(bprs ~ week + treatment + week * treatment + (week|subject) , data = bprs, REML = FALSE)
summary(bprs.lmris_int)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
The variances have barely changed, and the plot below also looks almost exactly the same as the previous model.
bprs$fitted_ris_int <- fitted(bprs.lmris_int)
int_observed_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Observed values") +
geom_line(aes(y=bprs)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
int_fit_plot <- ggplot(data=bprs, aes(x=week, color=treatment, group=subject )) +
ggtitle("Fitted values") +
geom_line(aes(y=fitted_ris_int)) +
labs(color= "Treatment") +
ylab("BPRS") +
scale_fill_manual("", values =(c("se"="#cccccc"))) +
scale_linetype_manual(guide="none") +
scale_color_manual(values =(c("1"="#ff9999","2"="#9999ff"))) +
scale_alpha(guide="none")
gridExtra::grid.arrange(int_observed_plot, int_fit_plot, ncol=2)
fig. 6.3, Model with random intercept and slope, including interaction between week and treatment
2.6 ANOVA between rand. intercept + slope model with interaction and without
anova(bprs.lmris_int,bprs.lmris)
## Data: bprs
## Models:
## bprs.lmris: bprs ~ week + treatment + (week | subject)
## bprs.lmris_int: bprs ~ week + treatment + week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs.lmris 7 2523.2 2550.4 -1254.6 2509.2
## bprs.lmris_int 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
We see that there is a negligible drop in deviance and even an increase in AIC, and p=0.18 — there is not a significant effect due to a week-treatment interaction.
Confidence intervals
Nevertheless, let’s check the confidence intervals to see if there is a clear effect from either the treatment or a treatment-time interaction:
confint(bprs.lmris_int)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 10.2191215 16.4882552
## .sig02 -0.8181446 -0.4021099
## .sig03 1.1186282 1.9764648
## .sigma 5.5925518 6.6009202
## (Intercept) 41.8936922 53.8774190
## week -3.3816817 -1.8749850
## treatment2 -10.7648856 6.1826634
## week:treatment2 -0.3495621 1.7812288
The fixed-effect coefficients for both treatment2 and
week:treatment2 straddle 0, which means we have to conclude
that there is no significant evidence of an effect in the data, even
using a mixed effects model with interaction.
2.7 Interpretation
We have to conclude that, using a linear mixed effects model, we cannot find a significant effect due to the treatment.
Social equality trend line
It is good to point out at this point that there is a clear trend line in the upper left corner of the biplot, moving from the right bottom to the top left. It seems like most of the countries are in this trend line. There is then a large spread of the rest of the countries, which is why this trend is not aligned with the principal components. Let’s call this trend line the social equality trend line, as all the countries at the top of this trend line have social welfare programs, such as universal healthcare, strong unions, and low GINI indices. (A bit hand-wavy, but this is a course assignment, not a research paper).